Skip to content

Add network recovery and stream stall handling for chat#455

Open
DHCross wants to merge 9 commits intomainfrom
claude/fix-raven-timeout-EBHKy
Open

Add network recovery and stream stall handling for chat#455
DHCross wants to merge 9 commits intomainfrom
claude/fix-raven-timeout-EBHKy

Conversation

@DHCross
Copy link
Copy Markdown
Owner

@DHCross DHCross commented Apr 26, 2026

Summary

This PR enhances the chat system's resilience by implementing network recovery mechanisms and stream stall detection. It adds retry logic with exponential backoff for network failures, online signal detection, and server-side stream timeout handling with user-facing fallback messages.

Key Changes

  • Network Recovery in useOracleChat:

    • Added NETWORK_RETRY_DELAYS_MS with exponential backoff delays (350ms, 900ms, 1800ms)
    • Implemented waitForOnlineSignal() to detect when browser comes back online
    • Refactored request logic into performRequestWithNetworkRecovery() that retries with both primary and fallback (compact) history
    • Added inflightAssistantRef to distinguish pre-stream failures (show error bubble) from post-stream failures (log only)
  • Stream Stall Detection in raven-chat route:

    • Added overall stream timeout (streamOverallMs) and idle timeouts (streamFirstChunkIdleMs, streamChunkIdleMs)
    • Implemented dual-timer system: overall timeout and per-chunk idle detection
    • Cancels stalled streams and sets streamStalled flag for integrity checking
  • Generation Integrity Updates:

    • Added STREAM_STALL_FINISH_REASON constant for "stream_stall" detection
    • Added "Clouded Skies" fallback message for incomplete generations due to stream stalls
    • Updated resolveReplyIntegrity() to handle stream stall scenarios with appropriate user messaging
  • UI Enhancements:

    • Added handleCounterpartSymbolicMoment() callback in main App component
    • Extended ProfileVault with onRunSymbolicMoment prop
    • Added onSymbolicMoment handler to CounterpartQuickActions component
  • Telemetry & Testing:

    • Updated creator mirror offline telemetry handling to allow creator-mode help after acknowledgment
    • Added test cases for symbolic moment offline telemetry and stream stall fallbacks
    • Updated runtime limits configuration for stream timeout values

Implementation Details

The network recovery uses a candidate-based retry strategy: it attempts the primary history first, and if network errors occur, falls back to compact history. The retry loop respects the exponential backoff delays and waits for online signals between attempts.

Stream stall detection uses a two-tier timeout approach: an overall timeout for the entire stream operation and idle timeouts that reset after each chunk is received. This prevents both hung connections and slow-streaming scenarios from blocking indefinitely.

https://claude.ai/code/session_018fAnZmYcnz8i9bqS3Nsjn6

@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 26, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
shipyard Error Error Apr 27, 2026 3:25am

Request Review

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens Raven chat resilience by adding client-side network retry/backoff, server-side upstream stream stall watchdogs, and integrity-layer messaging for stalled/incomplete generations, while also exposing “Symbolic Moment” as a quick action across the UI.

Changes:

  • Added network recovery (retry + online-signal waiting + fallback history) and client-side chunk read timeouts in useOracleChat.
  • Added server-side stream stall detection timers and “Clouded Skies” labeling for stalled/incomplete generations, plus runtime limit knobs.
  • Extended UI/UX to trigger Symbolic Moment reads from counterpart quick actions and the Profile Vault, with corresponding tests/telemetry updates.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
vessel/src/test/polyadic-chat-routing.test.ts Updates routing/UI label assertions for new “Moment” quick action strings.
vessel/src/lib/ravenPersona.ts Adds runtime limits for stream overall + idle watchdog timeouts via env-configurable values.
vessel/src/hooks/useOracleChat.ts Implements network retry/backoff, online detection, client chunk timeouts, and improved error/ghost handling.
vessel/src/components/chat/CounterpartQuickActions.tsx Adds onSymbolicMoment action and button in counterpart quick actions UI.
vessel/src/components/ProfileVault.tsx Adds “Read Symbolic Moment” buttons wired via optional callback prop.
vessel/src/app/page.tsx Wires Symbolic Moment quick actions into the main app flows and telemetry.
vessel/src/app/api/raven-chat/userBlockBuilder.ts Adjusts creator-mirror offline telemetry prompt to acknowledge offline state then proceed with creator help.
vessel/src/app/api/raven-chat/route.ts Adds upstream provider stream stall/idle watchdog timers and flags stall finish reasons.
vessel/src/app/api/raven-chat/generationIntegrity.ts Adds “stream_stall” labeling (“Clouded Skies”) for incomplete-generation fallbacks.
vessel/src/app/api/raven-chat/tests/userBlockBuilder.test.ts Adds coverage for creator-mirror offline telemetry behavior.
vessel/src/app/api/raven-chat/tests/generationIntegrity.test.ts Adds coverage for “Clouded Skies” labeling when finish reason is stream_stall.
vessel/sherlog-velocity/data/self-model.json Updates Sherlog velocity self-model artifact snapshot metadata/content.
vessel/sherlog-velocity/data/gap-history.jsonl Appends new Sherlog velocity gap-history entry.
Comments suppressed due to low confidence (1)

vessel/src/hooks/useOracleChat.ts:1738

  • The 413 compact-history retry currently bypasses the new network recovery loop and does a single performRequest(compactHistoryBase). If the retry hits a transient network failure, it won't get the exponential-backoff/online-signal recovery behavior introduced above. Consider routing this retry through performRequestWithNetworkRecovery() as well (or otherwise reusing the same recovery policy).
        if (response.status === 413 && !didCompactRetry && historyBase.length > compactHistoryBase.length) {
            didCompactRetry = true;
            response = await performRequest(compactHistoryBase);
        }

Comment on lines +100 to +115
const STREAM_STALL_FINISH_REASON = "stream_stall";
const CLOUDED_SKIES_LABEL =
"Clouded Skies — the channel went quiet before a full reading locked. The geometry is intact; only the rendering paused.";

function isStreamStall(finishReason?: string | null): boolean {
return (finishReason || "").trim().toLowerCase() === STREAM_STALL_FINISH_REASON;
}

function buildIncompleteReply(
input: ReplyIntegrityInput,
reason: string,
): string {
const stallPrefix = isStreamStall(input.providerFinishReason)
? `${CLOUDED_SKIES_LABEL}\n\n`
: "";

Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isStreamStall() is only used to prefix the incomplete fallback text. If a stream stalls after emitting some partial text, resolveReplyIntegrity() will currently treat the reply as generated (since trimmed is non-empty and it's not considered truncated), so users may see an incomplete/partial answer without the intended stall fallback. Consider treating providerFinishReason === 'stream_stall' (and/or providerCompleted === false) as an incomplete generation condition so the integrity layer reliably withholds partial stalled output.

Copilot uses AI. Check for mistakes.
Comment thread vessel/src/hooks/useOracleChat.ts Outdated
Comment on lines +1792 to +1796
timer = setTimeout(() => {
reader.cancel().catch(() => {});
reject(new Error('Raven stream stalled before any chunk arrived.'));
}, ms);
}),
Copy link

Copilot AI Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout rejection error message is always "Raven stream stalled before any chunk arrived." even when assistantMessageText is already non-empty (i.e., this is a mid-stream idle timeout). This makes the surfaced error misleading and harder to diagnose. Adjust the message (or error type) based on whether this was a first-chunk timeout vs. a subsequent idle timeout.

Copilot uses AI. Check for mistakes.
Copilot AI and others added 8 commits April 27, 2026 01:53
…s-for-symbolic-moment

# Conflicts:
#	vessel/src/lib/raven/symbolicMomentFrontstage.ts

Co-authored-by: DHCross <45954119+DHCross@users.noreply.github.com>
…lic-moment

Introduce deterministic phrasing variants for symbolic moment replies
… errors

When a Symbolic Moment read stalled mid-stream, three defects compounded:

- The Gemini reader had no per-chunk watchdog, so a stalled stream hung
  until Vercel's maxDuration killed the route, bypassing the labeled
  fallbacks in resolveReplyIntegrity.
- The client inserted an empty assistant bubble on the first DATA frame,
  leaving a ghost behind whenever no CHUNK followed.
- The outer try/catch wrapped post-stream enrichment, so a vault-sync or
  blind-mirror throw appended a generic "channel issue" bubble even when
  the reply had already streamed in.

Server: add idle (60s first-chunk / 30s subsequent) + overall (170s)
watchdogs around the Gemini reader. On stall, cancel the reader and set
providerFinishReason='stream_stall' so the existing integrity pipeline
emits GENERATION_INCOMPLETE with a Clouded Skies fallback instead of
a hung connection.

Client: defer assistant-bubble insertion until the first non-empty CHUNK,
add a 75s/45s defense-in-depth read timeout, and harden enrichment
(applyVaultAction, materializeAssistantBlindMirror, extractCheckpointQuestion)
in local try/catch blocks. Track in-flight render state in a ref so the
outer catch can suppress the redundant error bubble when the reply
already rendered. Defensively scrub any empty-text ghost on error.

Adds RAVEN_STREAM_FIRST_CHUNK_MS / RAVEN_STREAM_CHUNK_IDLE_MS /
RAVEN_STREAM_OVERALL_MS env knobs and three new integrity tests
covering the Clouded Skies stall path.

https://claude.ai/code/session_018fAnZmYcnz8i9bqS3Nsjn6
@sonarqubecloud
Copy link
Copy Markdown

Quality Gate Failed Quality Gate failed

Failed conditions
20.3% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants